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Abstract — We propose a novel method of introducing 
structure into existing machine learning techniques by 
developing structure-based similarity and distance mea- 
sures. To learn structural information, low-dimensional 
structure of the data is captured by solving a non-linear, 
low-rank representation problem. We show that this low- 
rank representation can be kernelized, has a closed-form 
solution, allows for separation of independent manifolds, 
and is robust to noise. From this representation, similarity 
between observations based on non-linear structure is 
computed and can be incorporated into existing feature 
transformations, dimensionality reduction techniques, and 
machine learning methods. Experimental results on both 
synthetic and real data sets show performance improve- 
ments for clustering, and anomaly detection through the 
use of structural similarity. 

I. Introduction 

The notion of distance, or more generally, similarity 
between observations, is at the root of most learning 
algorithms such as manifold learning, unsupervised clus- 
tering, semi-supervised learning, and anomaly detection. 
Most methods, at a basic level, are based on some 
function of Euclidean distance, such as the radial basis 
functions prevalent in supervised classification. Graph- 
based learning methods employ Euclidean distances to 
describe local neighborhoods for observations. K-nearest 
neighbors and e-neighborhoods based on Euclidean dis- 
tances are used in manifold learning (Isomap 0] and 
LLE (2)), spectral clustering Q, anomaly detection 
algorithms |4|, J5] and in label propagation algorithms 
for semi-supervised learning J6). 

In many cases, notably sparsely sampled sets of data, 
Euclidean neighborhoods are not sufficient to represent 
underlying structure. Consider data drawn from two 
independent structures. Ideally, a graph would capture 
the structure of the data with minimal connection be- 
tween observations on separate structures. For densely 
sampled data on the manifolds, as shown in Fig. Q] local 
Euclidean neighborhoods tend to lie on the same inde- 
pendent manifold, and therefore the K-nearest neighbor 




Fig. 1. K-nearest neighbor graph constructed on densely-sampled 2- 
D simulated data set. With densely sampled data points, the graph is 
representative of the underlying structure. 



graph using Euclidean distance captures the structure of 
the data. However, when the data is sparsely sampled, 
neighboring points in the Euclidean sense fail to lie on 
the same structure, as shown in Fig. |2] We propose a new 
notion of similarity that accounts for global structure as 
well as local Euclidean neighborhoods. By using this 
new notion of similarity, we can define neighborhoods 
dependent on both Euclidean distance as well as struc- 
tural similarity. 




Fig. 2. K-nearest neighbor graph constructed on sparsely-sampled 2- 
D simulated data set. The graph is not representative of the underlying 
structure of the data, as Euclidean neighborhoods include points lying 
on both independent structures. 

In many situations observations are well described by 
a single or a union of multiple low-dimensional mani- 
folds. Most methods deal with this scenario by patching 



together local neighborhoods or local linear subspaces. 
These local neighborhoods are generally based on lo- 
cal Euclidean balls around each data sample. In cases 
when observations are sparsely sampled, are noisy, or 
lie on places where manifold has high curvature, these 
local approximations could be inaccurate and may not 
characterize the underlying manifold. Sparse sampling 
commonly arises in high-dimensions and measurement 
noise is common to many signal processing applications. 

Unfortunately, Euclidean distance does not capture 
low-dimensional structure of observations unless the 
manifold is highly sampled. We propose a new approach 
to more accurately describe local neighborhoods by 
explicitly incorporating low-dimensional manifold struc- 
ture of data. We present a complementary method of 
incorporating similarity into existing machine learning 
techniques. Our approach can be used in conjunction 
with dimensionality reduction, feature transformation, 
and kernel selection techniques to incorporate structure. 

In order to capture the low-dimensional structure of 
the data, we first solve a regularized low-rank repre- 
sentation problem on the observed data. We derive a 
computationally-efficient closed-form solution solution 
which allows for handling large sets of observations. 
The resulting solution produces a low rank matrix, Z. 
Each column of the matrix, Z, is associated with a 
data sample and so the columns of the matrix represent 
a transformation of data into a new coordinate space. 
We show that these techniques can be extended to non- 
linear settings by demonstrating that the low-rank rep- 
resentation problem can be kernelized. We then present 
algorithms for estimating low rank representation for test 
samples that conform with the representation for training 
data. 

The problem of determining an approximate represen- 
tation for data using low-rank subspaces has recently 
drawn significant interest in the field of matrix com- 
pletion problems 13, ®. Methods for low -rank repre- 
sentations (LRR) of data drawn from multiple sources 
belonging to union of subspaces have also been devel- 
oped (9), iflOl , ifTTl . Low-rank representations seek to 
segment data vectors such that each segmented collection 
belongs to a low-dimensional linear subspace. Low-rank 
representation of data is related to traditional simultane- 
ous sparse representation techniques Ifl2ll with important 
differences. The objective in simultaneous sparsity is to 
decompose data vectors so that they have a common 
basis in a dictionary. In both the rank-minimization 
and simultaneous sparsity problems, the goal is repre- 
sentation of data subject to a structural constraint. In 
comparison , we are not interested in exact representation 



of observations, but instead in embedding points in a 
linear plane, with the notion of low-rank structures as 
a means of defining neighborhoods. In particular, our 
resulting minimization resembles the problem posed by 
Liu et al. 0. However, the problem posed in this paper 
can be solved in a computationally efficient manner, 
extended to nonlinear manifolds, and analyzed in the 
presence of noise. 

We theoretically show that the resulting low rank 
matrix Z has a block diagonal structure, with each 
block having low rank. In the linear setting, this implies 
that the data space is automatically decomposed into 
multiple linear/affine patches. Decomposition into non- 
linear patches follows from kernelized extensions. This 
new representation is purely geometric, applies to single 
or multiple manifolds, and does not require specifying a 
metric on the manifold. It is adaptive in that it does not 
require pre-specification of the number of data points for 
each patch. It is global in that the matrix Z is obtained 
by solving the low-rank representation on the entire data 
set. This block diagonal structure is essentially preserved 
in noisy situations or when the underlying manifold can 
only be approximated by linear or kernel representations. 

This new representation Z can be used in several 
ways. It can be used as samples from a new low- 
dimensional feature/observation space. Nearest neigh- 
bors for data samples or similarity between different data 
samples can be derived in this new representation. Alter- 
natively, structural similarity can be computed between 
observations. Consequently, this new representation can 
be viewed as a pre-processing step not only for most 
machine learning algorithms, but also as a pre-processing 
step for other pre-processing steps, such as graph con- 
struction, that are commonly undertaken for machine 
learning. We then present a number of simulations on 
a wide variety of data sets for a wide range of prob- 
lems including clustering, semi-supervised learning and 
anomaly detection. The simulations show remarkable 
improvement in performance over conventional methods. 

II. Structured Similarity and Neighborhoods 

A. Low-Rank Data Transformation 

Consider a set of observations, X = [xi, x%, . . . x n ], 
where xi £ M. dxl , approximately embedded on multiple 
independent, low-dimensional manifolds. Our goal is 
to discover these manifolds using by using techniques 
to learn low-rank representations of the data. In the 
case where the observations are embedded on linear 
subspaces, the low-rank representation (LRR) problem 
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can be formulated as: 



min 1 1 X 

z 



xz\\% 



(1) 



s.t. Rank(Z) = R 



where || • \\p is the Frobenius norm, and the solution, 
Z, is the minimum squared-error linear embedding on a 
i?-dimensional subspace. Relaxing the constraint in (Q]), 
the minimization can be equivalently written: 



min \\X 

z 



xzf F 



A ■ Rank(Z) 



(2) 



Optimizing the rank of a matrix is a non-convex, com- 
binatorial optimization. The convex relaxation of rank, 
the nuclear norm, is substituted, resulting in the convex 
optimization: 
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This related problem was originally posed as a subspace 
segmentation method by Liu et al. (9), who minimize the 
I2 1 1\ embedding error. A Kernelized Low-Rank Repre- 
sentation (KLRR) formulation of the problem naturally 
follows for the case where data is embedded on nonlinear 
subspaces: 

X '\<l>{X)-<\ ) {X)Zf F + \\\Z\\* (4) 
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where </)(•) is an expanded basis function with an asso- 
ciated kernel function, K(i,j) = (j>(i) T <fi(j). The form 
and parameters of the function <fi(-) are an assumption on 
the structure of the observations. Ideally, (/>(■) is chosen 
such that all observations are well approximated in the 
expanded basis space with a linear low-dimensional 
approximation while still maintaining the relationship 
between observations. As in all kernel methods, the ac- 
curacy of the approximation of the manifold is dependent 
on the ability of the kernel to fit the data. We refer only 
to the kernelized problem, as the linear problem is a 
specific case, where </>(X) = X and K(X,X) = X T X. 

Theorem 1. For A > 0, the KLRR problem © is 
minimized by the representation: 



Z* = UDxU 1 



(5) 



where the singular vectors, U, are found by the singular 
value decomposition of the kernel matrix K(X,X) = 
cf>(X) T (f>(X) = UDU T , and D\ is the diagonal matrix 
defined 
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1 - A if en > A 
otherwise 
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Proof: The nuclear norm is unitarily invariant, so 
substituting Z = U T ZU, where U is the matrix of sin- 
gular vectors of the kernel matrix produces an equivalent 
minimization 



U T ZU = argmin-||VDC/ T 
z 2 
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where VDU T is the singular value decomposition of 
the matrix 4>(X). The Frobenius norm is also unitar- 
ily invariant, and therefore pre-multiplying and post- 
multiplying the argument of the Frobenius norm by the 
matrices V T and U respectively produces the following 
minimization: 

1, 



U 1 ZU = argmin-||L> 
z 2 



DZ\\% 



A Z\ 



(8) 



In the above minimization, D is a diagonal matrix, 
resulting in Z also being a diagonal matrix, as any off 
diagonal elements will increase the value of the cost 
function in both the Frobenius norm and nuclear norm 
terms. With a diagonal structure, the Frobenius norm is 
equivalent to the ('2 norm of its weighted diagonal terms, 
and the nuclear norm of Z is equivalent to the l\ norm 
on its diagonal. 
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^ZU = D x = argmin -\\diag (d(I - Z)\ \\% 
z 2 V 

+X\\diag(Z)\\ 1 (9) 



The solution to this minimization is given by the soft- 
thresholding operator, which gives the closed-form solu- 
tion in the theorem statement. ■ 

The closed-form solution to the KLRR problem has 
previously been shown in ifTTl . with the closed=form 
solution to the linear low-rank representation problem 
shown in ff3l . 

From Theorem Q] low-rank representations of high- 
dimensional expanded basis space observations can be 
computed efficiently using only kernel functions. Addi- 
tionally, this solution has near block-diagonal structure 
for sets of observations existing on independent sub- 
spaces in the expanded basis space. 

Theorem 2. Given observations lying on independent 
subspaces in the expanded basis space, the low-rank 
representation is near block-diagonal, with elements off- 
block-diagonal bounded: 



Zij < A 



\ 



i:(7i >A 



(10) 



and Oi is the ith singular value of the kernel matrix. where Xi and Xj lie on independent manifolds. 
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Proof: Consider the case of <fi{X) = 
[<j>{Xi) <j)(X 2 )], where <j>{X x ) G R Dxni and 
4>(X 2 ) G M. Dxri2 span independent subspaces and have 
rank n and r2, ri+r 2 < D and Rank(0(X)) = ri +^2. 

0(X) is composed of observations lying on inde- 
pendent subspaces and therefore any point in 4>{X\) 
or <j){X.2j can be expressed as a linear combination of 
other points in the sets 4>{X\) or <p(X 2 ), respectively. 
Given the compact SVD, <f>(X) = Ux^xVx > where 
only singular vectors associated with non-zero singular 
values are included in the basis Ux and Vx- the linear 
transformation: 



W = T^-Ul^X) = V; 



x 
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preserves the dependence structure of <j>(X), resulting in 
the decomposition: 



W = [Bi B 2 ] 
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and B 2 



i+r 2 xr 2 



are 



where B 1 e R r ^ Xri 

basis matrices and a\ G R riXni and a 2 G R r2X " 2 
are the representation associated with each independent 
subspace. Substituting the singular value decompositions 
o-i = UiY.iV^' and a 2 = U 2 T, 2 V^, W can be ex- 
pressed: 

Vf 



W = B' 







(13) 



where B' = [-Bi£/i£i B 2 U 2 T, 2 ]. The matrix W is a 
set of singular vectors, so the following property must 
hold: 
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Therefore, B' is an orthonormal matrix. 

The solution to the KLRR problem can be expressed 
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(15) 

The term VxVx I s block diagonal, so the inner product 
between representations lying on separate independent 
subspaces can be bounded by bounding the off-block- 
diagonal elements: 



< r x I 



< 



F 

(16) 



From this structure, we construct measures of similar- 
ity that result in large distance between observations on 



separate manifolds independent of Euclidean distance, 
resulting in separation of independent manifolds. 

For reliable performance, the near block-diagonal 
structure of the KLRR matrix should remain in the 
presence of small perturbations. To demonstrate the 
robustness to noise, we bound the Frobenius norm of 
the off-diagonal elements when the kernel matrix is 
perturbed, guaranteeing near block-diagonal structure in 
the presence of noise. 

Theorem 3. Consider a perturbed kernel matrix 

K(X,X)=K(X,X) + E (17) 

where K(X,X) has a rank r composed of two inde- 
pendent subspaces. The perturbed KLRR, Z, is found 
using the matrix K(X,X). Define the matrix N to be 
the off diagonal blocks of the matrix Z, such that for all 
Zij G N, the observations Xi and Xj lie on independent 
manifolds. Then the matrix N is bounded: 

4\/2||£|| F 



IliVll 



< 



(18) 



where cr e is the largest singular value of the matrix E. 

The proof of this theorem is omitted due to length lim- 
itations. The bound is derived by bounding the canonical 
angle between perturbed eigenvectors, using Thm V.4. 1 
of Stewart and Sun 03). 

Theorem [3] illustrates the robustness to noise of the 
low-rank representation. This is an adversarial bound, 
with the no restrictions placed on the structure of 
the perturbations. Given small perturbations such that 
<r e G o(a r ), the norm of the off diagonal elements can 
be bounded linearly with the norm of the perturbation. 
Therefore, small perturbations in the observations have 
small effects on the structure of the KLRR representa- 
tion. 

B. Transformation for Test Observations 

Consider a new observation, x test G R dxl . In order 
to represent the new observation in the low-rank repre- 
sentation space, we extend the KLRR formulation from 
Equation (|4|i. We project the data onto the expanded 
basis span of the KLRR representation, <fi(X)Z, where 
Z is the KLRR on the training set X. The minimum 
norm projection can be calculated as a function of kernel 
functions. 



Ztest 



Z(Z T K(X,X)Z) Z T K(X, xtest) (19) 



The representation, z tes t, can be treated as a new sample 
from the low-rank feature space. This representation may 
be a poor representation of the original observation if it 
does not lie on the manifold. One measure of how well 
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a new observation is characterized by a manifold is by 
projecting the new observation onto the low-dimensional 
manifold $1% , then measuring the residual energy of the 
observation in the expanded basis space. 



rtest = Wriest) ~ <f>{X)Ztest\\ 



(20) 



This residual can also be calculated using only kernel 
functions and provides an evaluation of how well the 
low-rank representation fits a new observation. 

C. Structured Kernel Design for Supervised and Unsu- 
pervised Learning 

From the low-rank representation, we now present 
methods of constructing kernels. The low-rank transfor- 
mation of the raw data offers possibilities for designing 
kernels that incorporate the underlying structure in the 
data. 

In order to exploit this structure we consider some 
specific PSD kernels, which are basically the dot product, 
i.e., 



geometric distance. If x% and Xj lie on independent 
manifolds, from the structure of the KLRR matrix, the 
angle between the observations is small, and therefore 
the similarity, Sij, is also small. Alternatively, if the 
observations lie on the same low-dimensional manifold, 
but have a large geometric distance, the exponential term 
drives the similarity to a small value. 




Fig. 3. Graph for densely sampled 2-D simulated data set constructed 
by connecting the K-nearest structurally similar neighbors as defined 
by (22). 
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(21) 



where Wij is the similarity between observations i and 
j and Zi and Zj are the ith and jth columns of Z, 
respectively. The value Wij is the magnitude of the cosine 
of the angle between the vectors Zi and zj. Given the 
near block-diagonal structure of the KLRR matrix, as 
shown in Theorem 12 observations lying on independent 
subspaces have a very small similarity. 

One issue with this similarity function is that it is 
undefined if either Zi or Zj is identically zero. Conse- 
quently, we can define u>ij = if either Zi or Zj is 
zero. With this convention it is possible to show that 
this similarity satisfies the properties of PSD, i.e., 

T 1 1 



K{zi ^ ] ||*|| 

= Ki(zi, z )g{zi)g(z,j) = Ki(z t , z ] )k 2 {z il Zj) 

K2(zi,Zj) = g(zi)g(zj) is a valid PSD kernel (since it 
represents a rank one structure). Now since both K\ and 
K2 are valid PSD kernels it follows that K is a valid 
PSD kernel. The similarity proposed in (f2TJ captures 
the structure of the observations, however, the scaling 
information is lost. In order to incorporate structural 
information while preserving spatial relationships in the 
observation space, we propose the PSD kernel: 

\Zi, Zj) 



Nl 



(22) 



Two observations only have a large similarity if the 
observations lie on the same manifold and have a small 




Fig. 4. Graph constructed by conn ecting the K-nearest structurally 
similar neighbors as defined by i22\ . which captures the structure of 
the data. The Euclidean K-nearest neighbor graph fails to capture the 
structure of the data, as shown in Fig. [2] 



Figures [3] and |4] demonstrate the effect of using 
structural similarity as opposed to Euclidean distance on 
the same sets of data as presented in Figs.Q]and[2] In the 
case of densely sampled manifolds, using both notions 
of similarity construct graphs that capture the underlying 
structure of the data, as shown in Fig. [3] However, 
when the data is sparsely sampled, the use of structural 
similarity allows the graph to accurately characterize the 
underlying structure of the data, as shown in Fig. [4] 

From the definition of similarity posed in (f22j), a 
means of defining distance between observations fol- 
lows: 



d{Xi , Xj ) — \J ~Si% ~ 



2s, 



(23) 



The metric ( 1231 defines a new set of distances between 
observations combining both the structural similarity of 
the data as well as the Euclidean distance of the data. 
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Two observations have a small distance if and only if the 
observations lie on the same low-dimensional manifold 
and have a small distance in the observations space. 

III. Manifold Anomaly Detection 

The goal of our anomaly detection scheme is to define 
points not based on distance to nominal points, but 
instead based on distance to a low-dimensional manifold 
on which nominal points are embedded. In K-NNG and 
e-NNG approaches, the underlying manifold is modeled 
by dense sampling of data points, whereas our approach 
no longer requires dense sampling of data points, but 
instead structural assumptions on the data. For a set 
of nominal observations embedded on a manifold, we 
propose a method of anomaly detection based on p-value 
estimation |4|. 

Given a set of nominal training observations, X, 
a kernelized low-rank representation, Z, is found as 
described in Sectoin III-AI For a new test observation, 
Xt, a corresponding low-rank representation, z t , is found 
through the update method described in Section III-BI 
From these low-rank representations, the residual of the 
test observation is compared to the residuals of the 
labeled observations: 



Pi 



1 ™ 



(24) 



i=i 



where fj is the residual of the ith labeled observation, 
calculated as shown in d2Qt . and w t and are the 
average angles cosine similarities of the representations 
as defined in (|2T]> . The test observation is declared 
anomalous if pt > a. The proposed anomaly detec- 
tion characterizes the nominal set by a nonlinear low- 
dimensional manifold and uses a measure of similarity 
to the manifold to determine if test observations are 
anomalous. 

We modify our algorithm to simplify analysis. Assum- 
ing that n is even, we divide the training set into two 
sets 5i and S2. We compute the KLRR as defined by 
for the set Si and compute representations for the 
set S2 and the training sample, Xt, as defined by (fT9t . 
The p-value of the new observation is then estimated as 
follows: 



Pt 



1 X ^ 

\sT\ 2-f 1 ^ e ~ rt - 



(25) 



ies-2 



The distribution of p t approaches a uniform distri- 
bution over the range [0, 1] given that x t is nominal 
and drawn from the same distribution as the nominal 
observations, X. This follows from the lemma given by 
Zhao et al. JB1 : 



Lemma 4. Given a function G(x) has the 
nestedness property, that is, for any t\ > £2 
we have x : G(x) > tl C x : G(x) > t2. Then 
Px~f„ (G(x) > G(r/)) is uniformly distributed in [0, 1] 
ify~ fn- 

From this lemma, we can directly show that the 
distribution of p t converges to a uniform distribution. 

Theorem 5. For a nominal test point, xt, drawn from 
the same distribution as the labeled observations, X. p t 
converges to a uniformly distributed random variable in 
the range [0, 1]. 

Proof: This follows directly from Lemma |U as the 
function = Wie~ Ti has the nested property. ■ 

Therefore, the distribution of pt converges to a uni- 
form as n — > 00, and therefore the probability of false 
alarm converges to a. 

IV. Experimental Results 
A. Clustering 

Cosine Similarity Classification 



"XT* \ « 



PreOicted Class 1 
x Predicted Class 2 
o Errors 



Fig. 5. 2-D simulated data set. Two classes are constructed, a line and 
circle, with 200 observations in each class and random gaussian noise 
added. Sample clustering results on simulated 2-D line-circle data set 
are shown using k-means clustering on the structural similarity defined 
in Equation i2U . 



To evaluate performance, k-means clustering 111811 was 
performed on representations of the data, with the results 
shown in Table IIV-AI The k-means clustering algorithm 
was chosen as means to compare data representations 
due to its wide-spread use and lack of tuning parameters 
to be optimized. Initialization was performed by assign- 
ing observations to random clusters, with the error rates 
and standard deviations found for 100 random initializa- 
tions. K-means clustering was tested on the data in the 
original feature space and in the expanded basis space, 
4>{X). For the simulated data, the expanded basis was 
generated from a 3rd order inhomogeneous polynomial 
kernel multiplied with a Gaussian RBF kernel. Note that 
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Data Set 


Classes 


Observation 


Kernel 


Similarity (W) 


Simulated 


2 


49.4 ±0.1% 


35.8 ± 4.9% 


9.4 ± 0.1% 


Ionosphere 


2 


28.8 ±0.1% 


28.9 ±0.1% 


22.7 ± 0% 


Iris 


3 


17.3 ±9.8% 


15.3 ±8.5% 


7.6 ± 6.4% 


JAFFE 


10 


29.5 ± 4.4% 


25.5 ± 5.0% 


12.7 ±5.3% 



Large and Small Clusters 



TABLE I 

Average clustering error rates and standard deviation 
over 100 random initializations. Performance was 
compared for different representations: the original 
observations space (Observation), the expanded basis 
space (Kernel), and the cosine similarity space defined in 

[TT\ (Similarity (W)). For the IAFFE JT6J and Iris 
databases QT), these performance rates are comparable 

TO THE BEST ACHIEVED RESULTS IN LITERATURE AND ARE 

ACHIEVED using an extremely simple algorithm (k-means 

CLUSTERING) ON STRUCTURED DATA. 



this kernel does not perfectly transform the data to linear 
subspaces and therefore exact linear subspace recovery 
methods cannot be applied to this transform. For the 
Ionosphere, Iris, and JAFFE data sets, the expanded basis 
space was generated by Gaussian RBF kernels. 



Labeled nominal 
Unlabeled nominal 
Unlabeled anomalous 



Fig. 6. Example of simulated clusters data. Training was performed on 
20 labeled nominal points (blue circles), and testing was performed on 
50 unlabeled nominal points (green dots) and 50 unlabeled anomalous 
points (red crosses). 



Data Set 


Kernel 


Similarity (W) 


Struct. Kernel 


Simulated 


46.0 ± 0% 


17.5 ± 0% 


17.7 ± 0% 


Ionosphere 


35.9 ± 0% 


22.5 ± 0% 


22.8 ± 0% 


Iris 


15.9 ±3.1% 


5.2 ± 7.2% 


4.5 ± 5.9% 


JAFFE 


17.14 ±5.2% 


13.5 ±5.8% 


13.7 ±5.3% 



TABLE II 

Average spectral clustering error rates and standard 
deviation over 100 random initializations. performance 
WAS compared for different measures of similarity: the 

EXPANDED BASIS SPACE (KERNEL), THE COSINE SIMILARITY 
SPACE DEFINED IN J2l} (SIMILARITY (W)), AND THE STRUCTURED 
KERNEL SPACE DEFINED IN {22) (STRUCTURED KERNEL). FOR THE 

JAFFE fT6l and Iris databases (T7), these performance 

RATES ARE COMPARABLE TO THE BEST ACHIEVED RESULTS IN 
LITERATURE AND ARE ACHIEVED USING AN EXTREMELY SIMPLE 
ALGORITHM (K-MEANS CLUSTERING) ON STRUCTURED DATA. 

Spectral clustering performance on an expanded basis 
space is compared to similarity measures incorporating 
structure in Table IIV-AI As with k-means clustering, 
inclusion of structure improved clustering performance 
in all example cases. 

B. Anomaly Detection 

We compare performance of the P-value estimation 
technique with the K-nearest neighbors graph (K-NN) 
method presented by Zhao et al. J4j and a One-Class 
SVM |[T9l . We evaluate performance on simulated data 
sets, the Ionosphere dataset 1201 . the USPS Digits data 
set ED, and the JAFFE data set l22l . 

The simulated clusters data set [6] consists of nominal 
data composed of two Gaussian distributions with differ- 
ent variances, and anomalous data drawn from a uniform 
distribution. 20 random nominal points were used to 



<-l PF MKiuil 
R-PF MsthM 
- One Class SVM 



Fig. 7. ROC curves averaged over 100 randomly generated data 
sets. Performance of p-value estimation using the KLRR residual is 
compared to p-value estimation using a Euclidean neighborhood (2nd 
nearest neighbor) and One-Class SVM with y, = 0.5. 



train the classifier, and performance was measured on a 
test set composed of 50 unobserved nominal points and 
50 anomalous points, as shown in Fig. [6] A Gaussian 
radial basis function kernel was used to approximate 
the manifold, and performance was averaged over 100 
randomly generated data sets, with average performance 
shown in Fig. [7] 

The simulated linear data set was constructed of points 
generated form a linear subspace, with nominal points 
having small random perturbations and anomalous points 
having large perturbations. 20 random nominal points 
were used to train the classifier, and performance was 
measured on a test set composed of 50 unobserved 
nominal points and 50 anomalous points, as shown in 
Fig. [8] 100 random data sets were generated, with an 
average performance shown in Fig. [9] A linear low- 
rank representation of the labeled points was used to 
approximate the manifold. 
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Sparse Linear Data 



JAFFE Datasei 



Labeled Nominal Points 
Unlabeled Nominal Points 
Unlabeled Anomolous Points 



Fig. 8. Example of simulated linear data. Training on 20 labeled 
nominal points (blue circles), testing on 50 unlabeled nominal points 
(green dots) and 50 unlabeled anomalous points (red crosses). 



Sparse Linear Dataset 



- KLRR 
K-NN Method 
One Class SVM 



False Positive Rate 



Fig. 9. ROC curves averaged over 100 randomly generated data 
sets. Performance of p-value estimation using the KLRR residual is 
compared to p-value estimation using a Euclidean neighborhood (2nd 
nearest neighbor) and One-Class SVM with fi = 0.5. 



Ionosphere Dataset 



KLRR 

K-NN Method 
One Class SVM 



False Positive Rate 

Fig. 10. ROC curve using the proposed algorithm on the Ionosphere 
dataset. The ROC curve was generated by averaging results over 100 
random sets. Performance using the KLRR residual and Euclidean 
distance (3rd nearest neighbor) for p-value estimation are shown, as 
well as the performance of a One-Class SVM with /i = 0.5. 



For the Ionosphere data set 1201 . 175 observations 
were labeled as nominal observations (drawn from the 
set which show evidence of structure in the ionosphere) 



- KLRR 
K-NN 

One Class SVM 



False Positive Rate 



Fig. 11. ROC curve using the proposed algorithm on the JAFFE 
dataset. The ROC curve was generated by averaging results over 100 
random sets. Performance using the KLRR residual and Euclidean 
distance (3rd nearest neighbor) for p-value estimation are shown, as 
well as the performance of a One-Class SVM with fi = 0.5. 



and 30 observations were unlabeled for use as test data 
(drawn from both the "good" and "bad" observations). 
A gaussian radial basis function kernel was used, and 
performance was compared to anomaly detection using 
a K-nearest neighbor graph and a One-Class SVM, as 
shown in Figure [TOl 

For the JAFFE data set, 50 labeled nominal images 
were chosen from 3 random individuals (defined as 
nominal individuals) to construct the classifier. The test 
set was composed of 15 unobserved images randomly 
drawn from the nominal individuals and 100 anomalous 
images drawn from the other individuals. The perfor- 
mance using the KLRR residual was compared to the 
use of a K-nearest neighbor graph for p-value estimation 
flU and a One-Class SVM 02). For the USPS Digits 
data set, 200 nominal images (the digit 8) were labeled, 
with 167 unlabeled images randomly drawn from the 
unobserved nominal images and 33 anomalous images 
drawn from the other digits. A Gaussian RBF was used 
to find the low-rank representation for both the USPS 
and JAFFE data sets, and the same kernel functions 
were used in the One-Class SVM. Performance was 
averaged over 100 randomly assigned data sets for all 
experiments, with performance shown in Fig. QT| and 
Fig. Q~2] for the JAFFE and USPS data sets, respectively. 
Use of the KLRR residual energy improved classification 
performance for simulated and real-world data sets. The 
ROC curves for the experiments lie above the ROC 
curves for either the K-nearest neighbor method or the 
One-Class SVM, indicating that the underlying nominal 
distribution likely lies on a low-dimensional manifold, 
and this low-dimensional structure is well approximated 
by the Z. 



USPS Dataset 



- KLRR 
K-NN Method 
One Class SVM 



False Positive Rate 



Fig. 12. ROC curve on the USPS digits data set generated by 
averaging results over 100 random sets of labeled and unlabeled 
points. Performance using the KLRR residual and Euclidean distance 
(9th nearest neighbor) for p-value estimation are shown, as is the 
performance of a One-Class SVM with fi = 0.5. 
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